Unveiling the Secrets of Wine Quality¶
Bypassing traditional tasting methods, this project employs data analysis to predict the quality of wine based on its chemical features.
🍷 1. Introduction about the Data Set¶
📖 1.1 General Information:¶
- Provided by: The UCI Machine Learning Repository.
- Donated by: Paulo Cortez, Antonio Cerdeira, Fernando Almeida, Telmo Matos, and Jose Reis.
- 📜 Their paper: Modeling wine preferences by data mining from physicochemical properties.
- 🧐 Main idea: Using data mining to understand how various factors influence wine quality, offering insights into wine production and certification.
- ⚒️ Approach: Support Vector Machines (SVM), Neural Networks (NN), and Multiple Regression (MR) techniques.
- 🧰 Conclusion:
- For assessing wine quality, the Support Vector Machine (SVM) method outperforms other techniques in accuracy, especially for white wines.
- Alcohol level is a key factor in determining wine quality. Citric acid and residual sugar are more significant in white wines, whereas sulphates are highly important in both types.
🍇 1.2 Info about the Wine:¶
- Types: Both white and red wines from the Vinho Verde region in northwestern Portugal 🇵🇹.
- Production: Represents 15% of Portuguese production.
📊 1.3 Info about the Datasets:¶
- Wines: 1599 red and 4898 white samples.
- Collection:
- ⏳ Timeframe: May 2004 to February 2007.
- 🏷️ Type: Only protected designation of origin samples by CVRVV (Comissão de Viticultura da Região dos Vinhos Verdes), focused on enhancing the quality and marketing of vinho verde.
- Quality Assessment:
- Rated by at least three sensory assessors (blind tastes), on a 0 (very bad) to 10 (excellent) scale. The final score is the median of these ratings.
- Chemical Features Tested:
- 🧪 Data recorded by iLab, a computerized system managing wine sample testing.
- Tests include density, alcohol, pH values, etc.
- Limitation:
- Lack of Temporal Information:
- We are unable to analyze variations in wine quality across different years, also making it impossible for us to identify the relationship between weather conditions and wine quality.
- Lack of Brand and Public Preference Data:
- We are unable to establish a direct link between wine quality attributes and consumer preferences or sales performance.
- Lack of Temporal Information:
2. Research Questions and Motivations¶
2.1 Research Questions¶
Our reserach questions enhances and expands upon prior studies by:
🎯 Focusing on Classification: Utilizing advanced models like logistic regression and Random Forest for classifying wine quality tiers, assessing their accuracy, and pinpointing crucial quality influencers.
🛠️ Model Comparison Pipeline: Developing a systematic pipeline to contrast various models. This includes tuning hyperparameters and evaluating performance scores.
🍇 Quality Wine Recipes: Crafting formulas for both top-quality and poor-quality wines. These models aim to avert the production of low-quality wines and spotlight the unique attributes of top-tier wines.
🔍 Deep Dive with PCA: Investigating nuances in high-rated wines and applying Principal Component Analysis (PCA) for a more thorough exploration, surpassing traditional data mining approaches.
2.2 Motivations:¶
🇫🇷 Cultural Significance: Residing in France, a nation celebrated for its wine tradition, we seek to deepen our understanding of wine. This analysisfosters a greater appreciation of this heritage.
📊 Analytical Depth: Leveraging data-driven methods to explore wine quality nuances. This exploration will enhance our analytical skills while shedding light on hidden characteristics within wines.
🍾 Enhancing Wine Production: Providing actionable insights for quality improvement through advanced statistical and machine learning techniques.
3. Data Analysis¶
from ucimlrepo import fetch_ucirepo
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from scipy.stats import zscore
3.1 Extract Data: Reading from CSV Files¶
# These csv files are downloaded from the UCI website.
df_white_wine = pd.read_csv("data/winequality-white.csv", sep=";")
df_red_wine = pd.read_csv("data/winequality-red.csv",sep=";")
df_red_wine
df_white_wine
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.0 | 0.27 | 0.36 | 20.7 | 0.045 | 45.0 | 170.0 | 1.00100 | 3.00 | 0.45 | 8.8 | 6 |
| 1 | 6.3 | 0.30 | 0.34 | 1.6 | 0.049 | 14.0 | 132.0 | 0.99400 | 3.30 | 0.49 | 9.5 | 6 |
| 2 | 8.1 | 0.28 | 0.40 | 6.9 | 0.050 | 30.0 | 97.0 | 0.99510 | 3.26 | 0.44 | 10.1 | 6 |
| 3 | 7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47.0 | 186.0 | 0.99560 | 3.19 | 0.40 | 9.9 | 6 |
| 4 | 7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47.0 | 186.0 | 0.99560 | 3.19 | 0.40 | 9.9 | 6 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4893 | 6.2 | 0.21 | 0.29 | 1.6 | 0.039 | 24.0 | 92.0 | 0.99114 | 3.27 | 0.50 | 11.2 | 6 |
| 4894 | 6.6 | 0.32 | 0.36 | 8.0 | 0.047 | 57.0 | 168.0 | 0.99490 | 3.15 | 0.46 | 9.6 | 5 |
| 4895 | 6.5 | 0.24 | 0.19 | 1.2 | 0.041 | 30.0 | 111.0 | 0.99254 | 2.99 | 0.46 | 9.4 | 6 |
| 4896 | 5.5 | 0.29 | 0.30 | 1.1 | 0.022 | 20.0 | 110.0 | 0.98869 | 3.34 | 0.38 | 12.8 | 7 |
| 4897 | 6.0 | 0.21 | 0.38 | 0.8 | 0.020 | 22.0 | 98.0 | 0.98941 | 3.26 | 0.32 | 11.8 | 6 |
4898 rows × 12 columns
df_white_wine.head()
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.0 | 0.27 | 0.36 | 20.7 | 0.045 | 45.0 | 170.0 | 1.0010 | 3.00 | 0.45 | 8.8 | 6 |
| 1 | 6.3 | 0.30 | 0.34 | 1.6 | 0.049 | 14.0 | 132.0 | 0.9940 | 3.30 | 0.49 | 9.5 | 6 |
| 2 | 8.1 | 0.28 | 0.40 | 6.9 | 0.050 | 30.0 | 97.0 | 0.9951 | 3.26 | 0.44 | 10.1 | 6 |
| 3 | 7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47.0 | 186.0 | 0.9956 | 3.19 | 0.40 | 9.9 | 6 |
| 4 | 7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47.0 | 186.0 | 0.9956 | 3.19 | 0.40 | 9.9 | 6 |
df_red_wine.head()
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
| 1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |
| 2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |
| 3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 |
| 4 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
wine_lists = [df_red_wine, df_white_wine]
df_white_wine.wine_type = "White Wine"
df_red_wine.wine_type = "Red Wine"
def get_wine_str(wine_type_df):
return getattr(wine_type_df, 'wine_type', "Unknown Wine")
3.2 Transformation¶
Here we want to create QQ Plots to understand if the features follow normal distribution or not.
def check_numeric_columns(wine_type_df):
return wine_type_df.select_dtypes(include=[np.number]).columns
## Q-Q
# boxplot - log scale
def create_qq_plot(wine_type_df):
wine_type = get_wine_str(wine_type_df)
# Select only the numerical columns from the DataFrame
numeric_columns = check_numeric_columns(wine_type_df)
# Set up the matplotlib figure and axes for a 3x3 grid
fig, axs = plt.subplots(3, 4, figsize=(20, 15)) # Adjust the size as needed
# Flatten the array of axes to make it easier to iterate over
axs = axs.flatten()
# Loop over the numerical columns and create a Q-Q plot for each
for i, column in enumerate(numeric_columns):
data = wine_type_df[column]
stats.probplot(data, dist="norm", plot=axs[i])
axs[i].set_title(column)
axs[i].set_xlabel('')
axs[i].set_ylabel('')
# Adjust layout to prevent overlap
fig.suptitle(f"QQ Plots for {wine_type}", fontsize=16)
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()
def create_plots(plot_function, list = wine_lists):
for i in list:
plot_function(i)
create_plots(create_qq_plot, wine_lists)
Observations from QQ Plots of Red and White Wine Datasets:¶
Right Skewness:
- Red Wine:
Residual SugarandAlcoholshow deviations in the lower quantiles. - White Wine: Similar to Red Wine,
Residual SugarandAlcoholshow deviations in the lower quantiles.
- Red Wine:
Left Skewness:
- Red Wine:
Free Sulfur Dioxide,Chlorides, andSulphatesshow deviations in the upper quantiles. - White Wine:
Volatile Acidity,Chlorides, andSulphatesshow deviations in the upper quantiles.
- Red Wine:
Implications for Data Processing:
- The observed skewness in both datasets suggests the need for normalization transformations. We will continue to calculate skewness coefficient.
- Techniques like logarithmic or Box-Cox transformations may be beneficial to address these deviations and improve the homogeneity of the data.
def calculate_skewness_coefficient(wine_type_df):
print(f"\nThe skewness coefficient of {get_wine_str(wine_type_df)}: \n")
numerical_columns = wine_type_df.select_dtypes(include=['number']).columns
for column in numerical_columns:
skewness = round(wine_type_df[column].skew(),2)
print(f"{column}: {skewness}")
create_plots(calculate_skewness_coefficient, wine_lists)
The skewness coefficient of Red Wine: fixed acidity: 0.98 volatile acidity: 0.67 citric acid: 0.32 residual sugar: 4.54 chlorides: 5.68 free sulfur dioxide: 1.25 total sulfur dioxide: 1.52 density: 0.07 pH: 0.19 sulphates: 2.43 alcohol: 0.86 quality: 0.22 The skewness coefficient of White Wine: fixed acidity: 0.65 volatile acidity: 1.58 citric acid: 1.28 residual sugar: 1.08 chlorides: 5.02 free sulfur dioxide: 1.41 total sulfur dioxide: 0.39 density: 0.98 pH: 0.46 sulphates: 0.98 alcohol: 0.49 quality: 0.16
Observations from the skewness coefficient
- White Wine:
- Chlorides (5.02): Highly skewness.
- Volatile Acidity (1.58), Citric Acid (1.28), Residual Sugar (1.08), Free Sulfur Dioxide (1.41): : Moderate skewness.
- Red Wine:
- Residual Sugar (4.54) and Chlorides (5.68): Highly skewness
- Free Sulfur Dioxide (1.25), Total Sulfur Dioxide (1.52), Sulphates (2.43): Moderate skewness.
We are going to do log transformation for features of highly skewness and moderate skewness.
white_wine_log_columns = [
'chlorides',
'volatile acidity',
'citric acid',
'residual sugar',
'free sulfur dioxide'
]
red_wine_log_columns = [
'residual sugar',
'chlorides',
'free sulfur dioxide',
'total sulfur dioxide',
'sulphates'
]
log_red_wine_df = df_red_wine.copy()
log_white_wine_df = df_white_wine.copy()
log_red_wine_df[red_wine_log_columns] = np.log(log_red_wine_df[red_wine_log_columns] + 0.001)
log_white_wine_df[white_wine_log_columns] = np.log(log_white_wine_df[white_wine_log_columns]+ 0.001)
log_red_wine_df.wine_type = "Red Wine(Log)"
log_white_wine_df.wine_type = "White Wine(Log)"
log_dfs = [log_red_wine_df, log_white_wine_df]
create_plots(create_qq_plot, log_dfs)
create_plots(calculate_skewness_coefficient, log_dfs)
The skewness coefficient of Red Wine(Log): fixed acidity: 0.98 volatile acidity: 0.67 citric acid: 0.32 residual sugar: 1.81 chlorides: 1.79 free sulfur dioxide: -0.23 total sulfur dioxide: -0.08 density: 0.07 pH: 0.19 sulphates: 0.92 alcohol: 0.86 quality: 0.22 The skewness coefficient of White Wine(Log): fixed acidity: 0.65 volatile acidity: 0.14 citric acid: -5.56 residual sugar: -0.16 chlorides: 1.19 free sulfur dioxide: -0.94 total sulfur dioxide: 0.39 density: 0.98 pH: 0.46 sulphates: 0.98 alcohol: 0.49 quality: 0.16
Observations based on the first log transformation:
Red Wine (Original vs Log-Transformed):
- Original:
Residual sugar4.54,Chlorides5.68. - Log-Transformed:
Residual sugar1.81,Chlorides1.79. Free sulfur dioxidechanged from positive (1.25) to slightly negative skewness (-0.23).
- Original:
White Wine (Original vs Log-Transformed):
- Original:
Volatile acidity1.58,Citric acid1.28. - Log-Transformed:
Volatile acidity0.14,Citric acid-5.56 (over-correction). Residual sugarreduced from 1.08 to -0.16,Chloridesfrom 5.02 to 1.19.
- Original:
Minimal Impact on Some Variables:
Alcoholandqualityin both Red and White wines showed minimal changes (around 0.86 and 0.22 respectively).
Avoid Log Transformation For:
- Red Wine:
Free sulfur dioxideandTotal sulfur dioxide. - White Wine:
Citric acidandResidual sugar.
- Red Wine:
white_wine_update_log_columns = [
'chlorides',
'volatile acidity'
]
red_wine_update_log_columns = [
'residual sugar',
'chlorides'
]
log_red_update_wine_df = df_red_wine.copy()
log_white_update_wine_df = df_white_wine.copy()
log_red_update_wine_df[red_wine_update_log_columns] = np.log(log_red_update_wine_df[red_wine_update_log_columns] + 0.001)
log_white_update_wine_df[white_wine_update_log_columns] = np.log(log_white_update_wine_df[white_wine_update_log_columns]+ 0.001)
log_red_update_wine_df.wine_type = "Red Wine(Second Log)"
log_white_update_wine_df.wine_type = "White Wine(Second Log)"
log_update_dfs = [log_white_update_wine_df, log_red_update_wine_df]
create_plots(create_qq_plot, log_update_dfs)
create_plots(calculate_skewness_coefficient, log_update_dfs)
The skewness coefficient of White Wine(Second Log): fixed acidity: 0.65 volatile acidity: 0.14 citric acid: 1.28 residual sugar: 1.08 chlorides: 1.19 free sulfur dioxide: 1.41 total sulfur dioxide: 0.39 density: 0.98 pH: 0.46 sulphates: 0.98 alcohol: 0.49 quality: 0.16 The skewness coefficient of Red Wine(Second Log): fixed acidity: 0.98 volatile acidity: 0.67 citric acid: 0.32 residual sugar: 1.81 chlorides: 1.79 free sulfur dioxide: 1.25 total sulfur dioxide: 1.52 density: 0.07 pH: 0.19 sulphates: 2.43 alcohol: 0.86 quality: 0.22
Observations:
- White Wine:
- Log transformation significantly reduced skewness in
volatile acidity(from 1.58 to 0.14) andchlorides(from 5.02 to 1.19).
- Log transformation significantly reduced skewness in
- Red Wine:
- Effective reduction in skewness for
residual sugar(from 4.54 to 1.81) andchlorides(from 5.68 to 1.79).
- Effective reduction in skewness for
- Conclusion:
- The second log transformation was successful in reducing high skewness for key variables in both Red and White Wine datasets.
3.3 Clean Data¶
3.3.1 Check for missing values¶
We are going to check if there are empty values.
def check_na(wine_type_df):
print(f'{get_wine_str(wine_type_df)}')
print(wine_type_df.info())
print(wine_type_df.isnull().sum())
create_plots(check_na, log_update_dfs)
White Wine(Second Log) <class 'pandas.core.frame.DataFrame'> RangeIndex: 4898 entries, 0 to 4897 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 fixed acidity 4898 non-null float64 1 volatile acidity 4898 non-null float64 2 citric acid 4898 non-null float64 3 residual sugar 4898 non-null float64 4 chlorides 4898 non-null float64 5 free sulfur dioxide 4898 non-null float64 6 total sulfur dioxide 4898 non-null float64 7 density 4898 non-null float64 8 pH 4898 non-null float64 9 sulphates 4898 non-null float64 10 alcohol 4898 non-null float64 11 quality 4898 non-null int64 dtypes: float64(11), int64(1) memory usage: 459.3 KB None fixed acidity 0 volatile acidity 0 citric acid 0 residual sugar 0 chlorides 0 free sulfur dioxide 0 total sulfur dioxide 0 density 0 pH 0 sulphates 0 alcohol 0 quality 0 dtype: int64 Red Wine(Second Log) <class 'pandas.core.frame.DataFrame'> RangeIndex: 1599 entries, 0 to 1598 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 fixed acidity 1599 non-null float64 1 volatile acidity 1599 non-null float64 2 citric acid 1599 non-null float64 3 residual sugar 1599 non-null float64 4 chlorides 1599 non-null float64 5 free sulfur dioxide 1599 non-null float64 6 total sulfur dioxide 1599 non-null float64 7 density 1599 non-null float64 8 pH 1599 non-null float64 9 sulphates 1599 non-null float64 10 alcohol 1599 non-null float64 11 quality 1599 non-null int64 dtypes: float64(11), int64(1) memory usage: 150.0 KB None fixed acidity 0 volatile acidity 0 citric acid 0 residual sugar 0 chlorides 0 free sulfur dioxide 0 total sulfur dioxide 0 density 0 pH 0 sulphates 0 alcohol 0 quality 0 dtype: int64
Observations from Checking Missing Values:
- There are no missing data in either dataset.
- All feature columns are of the float type, and the target column is an integer.
3.3.2 Outlier Analysis¶
We will create box plots to understand the outliers.
def create_box_plot(wine_type_df):
feature_columns = wine_type_df.drop(columns=['quality']).columns.tolist()
fig, axs = plt.subplots(3, 4, figsize=(20, 15))
# Flatten the array of axes to make it easier to iterate over
axs = axs.flatten()
for i, column in enumerate(feature_columns):
sns.boxplot(x='quality', y=column, hue='quality', data=wine_type_df, ax=axs[i], palette='dark:.8', legend=False)
fig.suptitle(f'Box Plots for {get_wine_str(wine_type_df)}', fontsize=16)
plt.show()
create_plots(create_box_plot, log_update_dfs)
Observations from Box Plots
- It's hard to decide should we delete outliers or not based on these plots.
- We are going to develop separate QQ plots for wines classified into different quality categories: bad(3-4), middle(5-6-7), and good(8-9) to better understand our outliers.
bin_edges = [2, 4, 7, 10]
bin_labels = ['poor', 'middle', 'good']
red_wine_quality_df = log_red_update_wine_df.copy()
white_wine_quality_df = log_white_update_wine_df.copy()
red_wine_quality_df['quality_category'] = pd.cut(red_wine_quality_df['quality'], bins=bin_edges, labels=bin_labels)
white_wine_quality_df['quality_category'] = pd.cut(white_wine_quality_df['quality'], bins=bin_edges, labels=bin_labels)
df_red_poor = red_wine_quality_df[red_wine_quality_df['quality_category'] == 'poor']
df_red_middle = red_wine_quality_df[red_wine_quality_df['quality_category'] == 'middle']
df_red_good = red_wine_quality_df[red_wine_quality_df['quality_category'] == 'good']
df_white_poor = white_wine_quality_df[white_wine_quality_df['quality_category'] == 'poor']
df_white_middle = white_wine_quality_df[white_wine_quality_df['quality_category'] == 'middle']
df_white_good = white_wine_quality_df[white_wine_quality_df['quality_category'] == 'good']
wine_quality_dfs = [
df_white_poor,
df_white_middle,
df_white_good,
df_red_poor,
df_red_middle,
df_red_good
]
df_white_poor.wine_type = 'White Wine Poor'
df_white_middle.wine_type = 'White Wine Middle'
df_white_good.wine_type = 'White Wine Good'
df_red_poor.wine_type = 'Red Wine Poor'
df_red_middle.wine_type = 'Red Wine Middle'
df_red_good.wine_type = 'Red Wine Good'
create_plots(create_qq_plot, wine_quality_dfs)
To understand distribution asymmetries, we use skewness coefficient to identify which variables deviate from normality.
create_plots(calculate_skewness_coefficient, wine_quality_dfs)
The skewness coefficient of White Wine Poor: fixed acidity: 0.86 volatile acidity: 0.33 citric acid: 0.43 residual sugar: 1.07 chlorides: 1.17 free sulfur dioxide: 4.48 total sulfur dioxide: 1.07 density: 0.23 pH: 0.56 sulphates: 0.68 alcohol: 0.61 quality: -2.53 The skewness coefficient of White Wine Middle: fixed acidity: 0.62 volatile acidity: 0.06 citric acid: 1.38 residual sugar: 1.07 chlorides: 1.22 free sulfur dioxide: 0.64 total sulfur dioxide: 0.31 density: 1.01 pH: 0.47 sulphates: 0.99 alcohol: 0.52 quality: 0.19 The skewness coefficient of White Wine Good: fixed acidity: -0.4 volatile acidity: 0.07 citric acid: 0.52 residual sugar: 0.86 chlorides: 0.48 free sulfur dioxide: 1.47 total sulfur dioxide: 0.56 density: 1.14 pH: 0.06 sulphates: 0.98 alcohol: -0.92 quality: 5.8 The skewness coefficient of Red Wine Poor: fixed acidity: 0.89 volatile acidity: 0.65 citric acid: 1.56 residual sugar: 1.36 chlorides: 2.34 free sulfur dioxide: 1.38 total sulfur dioxide: 1.29 density: 0.44 pH: -0.31 sulphates: 4.47 alcohol: 0.56 quality: -1.91 The skewness coefficient of Red Wine Middle: fixed acidity: 1.0 volatile acidity: 0.5 citric acid: 0.28 residual sugar: 1.84 chlorides: 1.74 free sulfur dioxide: 1.25 total sulfur dioxide: 1.51 density: 0.09 pH: 0.21 sulphates: 2.31 alcohol: 0.86 quality: 0.52 The skewness coefficient of Red Wine Good: fixed acidity: 0.04 volatile acidity: 1.72 citric acid: -0.39 residual sugar: 1.38 chlorides: -1.17 free sulfur dioxide: 1.49 total sulfur dioxide: 1.34 density: -0.24 pH: 0.42 sulphates: 1.46 alcohol: -0.23 quality: 0
To pinpoint extreme outliers, we use a a threshold of Z-score > 5 to focuse on the most anomalous data points in the wine quality dataset.
def calculate_z_score(wine_type_df):
z_scores_df = wine_type_df.copy()
print(f"\nZ Score of {get_wine_str(wine_type_df)}:\n")
for col in wine_type_df.columns:
if col != 'quality' and col != 'quality_category':
z_scores_df[col + ' Z-score'] = zscore(wine_type_df[col])
outliers = abs(z_scores_df[col + ' Z-score']) > 5
print(f"Outliers in {col}: {outliers.sum()}")
return z_scores_df
create_plots(calculate_z_score, wine_quality_dfs)
Z Score of White Wine Poor: Outliers in fixed acidity: 0 Outliers in volatile acidity: 0 Outliers in citric acid: 0 Outliers in residual sugar: 0 Outliers in chlorides: 1 Outliers in free sulfur dioxide: 1 Outliers in total sulfur dioxide: 0 Outliers in density: 0 Outliers in pH: 0 Outliers in sulphates: 0 Outliers in alcohol: 0 Z Score of White Wine Middle: Outliers in fixed acidity: 1 Outliers in volatile acidity: 0 Outliers in citric acid: 8 Outliers in residual sugar: 1 Outliers in chlorides: 6 Outliers in free sulfur dioxide: 2 Outliers in total sulfur dioxide: 0 Outliers in density: 3 Outliers in pH: 0 Outliers in sulphates: 2 Outliers in alcohol: 0 Z Score of White Wine Good: Outliers in fixed acidity: 0 Outliers in volatile acidity: 0 Outliers in citric acid: 0 Outliers in residual sugar: 0 Outliers in chlorides: 0 Outliers in free sulfur dioxide: 0 Outliers in total sulfur dioxide: 0 Outliers in density: 0 Outliers in pH: 0 Outliers in sulphates: 0 Outliers in alcohol: 0 Z Score of Red Wine Poor: Outliers in fixed acidity: 0 Outliers in volatile acidity: 0 Outliers in citric acid: 0 Outliers in residual sugar: 0 Outliers in chlorides: 0 Outliers in free sulfur dioxide: 0 Outliers in total sulfur dioxide: 0 Outliers in density: 0 Outliers in pH: 0 Outliers in sulphates: 1 Outliers in alcohol: 0 Z Score of Red Wine Middle: Outliers in fixed acidity: 0 Outliers in volatile acidity: 0 Outliers in citric acid: 0 Outliers in residual sugar: 6 Outliers in chlorides: 12 Outliers in free sulfur dioxide: 1 Outliers in total sulfur dioxide: 2 Outliers in density: 0 Outliers in pH: 0 Outliers in sulphates: 7 Outliers in alcohol: 0 Z Score of Red Wine Good: Outliers in fixed acidity: 0 Outliers in volatile acidity: 0 Outliers in citric acid: 0 Outliers in residual sugar: 0 Outliers in chlorides: 0 Outliers in free sulfur dioxide: 0 Outliers in total sulfur dioxide: 0 Outliers in density: 0 Outliers in pH: 0 Outliers in sulphates: 0 Outliers in alcohol: 0
Observations from Q-Q Plot, Skenewss Coefficient, and Z-score
White Wine:¶
- Poor Quality:
- High skewness:
free sulfur dioxide(4.48),chlorides(1.17). - Notable outliers:
chlorides(6),free sulfur dioxide(5).
- High skewness:
- Middle Quality:
- Skewness:
citric acid(1.38),chlorides(1.22),density(1.01). - Significant outliers:
chlorides(102),citric acid(80),sulphates(42).
- Skewness:
- Good Quality:
- Skewness:
free sulfur dioxide(1.47),density(1.14),sulphates(0.98). - Fewer outliers:
free sulfur dioxide(4),chlorides/citric acid(3).
- Skewness:
Red Wine:¶
- Poor Quality:
- High skewness:
sulphates(4.47),chlorides(2.34). - Outliers across variables:
total sulfur dioxide(2),chlorides(1),sulphates(1).
- High skewness:
- Middle Quality:
- Skewness:
sulphates(2.31),residual sugar(1.84),chlorides(1.74). - Significant outliers:
chlorides(35),residual sugar(29),sulphates(27).
- Skewness:
- Good Quality:
- Slight skewness:
volatile acidity(1.72),sulphates(1.46). - Minimal outliers(only one in
volatile acidity).
- Slight skewness:
Create heatmaps to understant the relationship between features and wine quality score.
def create_corr_matrix(wine_type_df):
numeric_columns = check_numeric_columns(wine_type_df)
#print(f'Correlation matrix for {get_wine_str(wine_type_df)}')
#print(wine_type_df[numeric_columns].corr())
return wine_type_df[numeric_columns].corr()
cmap = sns.diverging_palette(230, 20, as_cmap=True)
def create_heat_map(wine_type_df):
correlation_matrix = create_corr_matrix(wine_type_df)
plt.figure(figsize=(15, 10))
## do not display features that has a low correlation
mask_low_corr = np.abs(correlation_matrix) < 0.05
sns.heatmap(correlation_matrix, annot=True, cmap=cmap, mask = mask_low_corr, linewidths=.5, vmax=1, vmin=-1)
plt.title(f'Correlation matrix for {get_wine_str(wine_type_df)}')
plt.xticks(rotation=45)
plt.yticks(rotation=45)
plt.show()
def create_clustermap(wine_type_df):
correlation_matrix = create_corr_matrix(wine_type_df)
## do not display features that has a low correlation
mask_low_corr = np.abs(correlation_matrix) < 0.05
print(mask_low_corr)
plt.figure(figsize=(15, 15))
sns.clustermap(correlation_matrix, annot=True, mask = mask_low_corr, cmap= cmap,linewidths=.5, vmax=1, vmin=-1)
plt.title(f'Correlation matrix for {get_wine_str(wine_type_df)}')
plt.xticks(rotation=45)
plt.yticks(rotation=45)
plt.show()
create_plots(create_heat_map, wine_quality_dfs)
create_clustermap(df_red_good)
fixed acidity volatile acidity citric acid
fixed acidity False False False \
volatile acidity False False False
citric acid False False False
residual sugar False False False
chlorides False False False
free sulfur dioxide False True False
total sulfur dioxide False False False
density False False False
pH False False False
sulphates False False False
alcohol False False False
quality False False False
residual sugar chlorides free sulfur dioxide
fixed acidity False False False \
volatile acidity False False True
citric acid False False False
residual sugar False False False
chlorides False False False
free sulfur dioxide False False False
total sulfur dioxide False False False
density False False False
pH False False False
sulphates True False False
alcohol False False False
quality False False False
total sulfur dioxide density pH sulphates
fixed acidity False False False False \
volatile acidity False False False False
citric acid False False False False
residual sugar False False False True
chlorides False False False False
free sulfur dioxide False False False False
total sulfur dioxide False False False True
density False False False False
pH False False False False
sulphates True False False False
alcohol False False False False
quality False False False False
alcohol quality
fixed acidity False False
volatile acidity False False
citric acid False False
residual sugar False False
chlorides False False
free sulfur dioxide False False
total sulfur dioxide False False
density False False
pH False False
sulphates False False
alcohol False False
quality False False
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[41], line 1 ----> 1 create_clustermap(df_red_good) Cell In[40], line 33, in create_clustermap(wine_type_df) 30 print(mask_low_corr) 32 plt.figure(figsize=(15, 15)) ---> 33 sns.clustermap(correlation_matrix, annot=True, mask = mask_low_corr, cmap= cmap,linewidths=.5, vmax=1, vmin=-1) 35 plt.title(f'Correlation matrix for {get_wine_str(wine_type_df)}') 36 plt.xticks(rotation=45) File /usr/local/lib/python3.11/site-packages/seaborn/matrix.py:1258, in clustermap(data, pivot_kws, method, metric, z_score, standard_scale, figsize, cbar_kws, row_cluster, col_cluster, row_linkage, col_linkage, row_colors, col_colors, mask, dendrogram_ratio, colors_ratio, cbar_pos, tree_kws, **kwargs) 1250 raise RuntimeError("clustermap requires scipy to be available") 1252 plotter = ClusterGrid(data, pivot_kws=pivot_kws, figsize=figsize, 1253 row_colors=row_colors, col_colors=col_colors, 1254 z_score=z_score, standard_scale=standard_scale, 1255 mask=mask, dendrogram_ratio=dendrogram_ratio, 1256 colors_ratio=colors_ratio, cbar_pos=cbar_pos) -> 1258 return plotter.plot(metric=metric, method=method, 1259 colorbar_kws=cbar_kws, 1260 row_cluster=row_cluster, col_cluster=col_cluster, 1261 row_linkage=row_linkage, col_linkage=col_linkage, 1262 tree_kws=tree_kws, **kwargs) File /usr/local/lib/python3.11/site-packages/seaborn/matrix.py:1129, in ClusterGrid.plot(self, metric, method, colorbar_kws, row_cluster, col_cluster, row_linkage, col_linkage, tree_kws, **kws) 1125 kws.pop("square") 1127 colorbar_kws = {} if colorbar_kws is None else colorbar_kws -> 1129 self.plot_dendrograms(row_cluster, col_cluster, metric, method, 1130 row_linkage=row_linkage, col_linkage=col_linkage, 1131 tree_kws=tree_kws) 1132 try: 1133 xind = self.dendrogram_col.reordered_ind File /usr/local/lib/python3.11/site-packages/seaborn/matrix.py:974, in ClusterGrid.plot_dendrograms(self, row_cluster, col_cluster, metric, method, row_linkage, col_linkage, tree_kws) 970 def plot_dendrograms(self, row_cluster, col_cluster, metric, method, 971 row_linkage, col_linkage, tree_kws): 972 # Plot the row dendrogram 973 if row_cluster: --> 974 self.dendrogram_row = dendrogram( 975 self.data2d, metric=metric, method=method, label=False, axis=0, 976 ax=self.ax_row_dendrogram, rotate=True, linkage=row_linkage, 977 tree_kws=tree_kws 978 ) 979 else: 980 self.ax_row_dendrogram.set_xticks([]) File /usr/local/lib/python3.11/site-packages/seaborn/matrix.py:687, in dendrogram(data, linkage, axis, label, metric, method, rotate, tree_kws, ax) 684 if _no_scipy: 685 raise RuntimeError("dendrogram requires scipy to be installed") --> 687 plotter = _DendrogramPlotter(data, linkage=linkage, axis=axis, 688 metric=metric, method=method, 689 label=label, rotate=rotate) 690 if ax is None: 691 ax = plt.gca() File /usr/local/lib/python3.11/site-packages/seaborn/matrix.py:495, in _DendrogramPlotter.__init__(self, data, linkage, metric, method, axis, label, rotate) 492 self.rotate = rotate 494 if linkage is None: --> 495 self.linkage = self.calculated_linkage 496 else: 497 self.linkage = linkage File /usr/local/lib/python3.11/site-packages/seaborn/matrix.py:562, in _DendrogramPlotter.calculated_linkage(self) 558 msg = ("Clustering large matrix with scipy. Installing " 559 "`fastcluster` may give better performance.") 560 warnings.warn(msg) --> 562 return self._calculate_linkage_scipy() File /usr/local/lib/python3.11/site-packages/seaborn/matrix.py:530, in _DendrogramPlotter._calculate_linkage_scipy(self) 529 def _calculate_linkage_scipy(self): --> 530 linkage = hierarchy.linkage(self.array, method=self.method, 531 metric=self.metric) 532 return linkage File /usr/local/lib/python3.11/site-packages/scipy/cluster/hierarchy.py:1064, in linkage(y, method, metric, optimal_ordering) 1061 raise ValueError("`y` must be 1 or 2 dimensional.") 1063 if not np.all(np.isfinite(y)): -> 1064 raise ValueError("The condensed distance matrix must contain only " 1065 "finite values.") 1067 n = int(distance.num_obs_y(y)) 1068 method_code = _LINKAGE_METHODS[method] ValueError: The condensed distance matrix must contain only finite values.
<Figure size 1500x1500 with 0 Axes>
# print(create_plots(create_corr_matrix, wine_quality_dfs))
create_plots(create_clustermap, wine_quality_dfs)
<Figure size 1500x1500 with 0 Axes>
<Figure size 1500x1500 with 0 Axes>
<Figure size 1500x1500 with 0 Axes>
<Figure size 1500x1500 with 0 Axes>
<Figure size 1500x1500 with 0 Axes>
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[28], line 1 ----> 1 create_plots(create_clustermap, wine_quality_dfs) Cell In[8], line 3, in create_plots(plot_function, list) 1 def create_plots(plot_function, list = wine_lists): 2 for i in list: ----> 3 plot_function(i) Cell In[25], line 32, in create_clustermap(wine_type_df) 29 mask_low_corr = np.abs(correlation_matrix) < 0.05 31 plt.figure(figsize=(15, 15)) ---> 32 sns.clustermap(correlation_matrix, annot=True, mask = mask_low_corr, cmap= cmap,linewidths=.5, vmax=1, vmin=-1) 34 plt.title(f'Correlation matrix for {get_wine_str(wine_type_df)}') 35 plt.xticks(rotation=45) File /usr/local/lib/python3.11/site-packages/seaborn/matrix.py:1258, in clustermap(data, pivot_kws, method, metric, z_score, standard_scale, figsize, cbar_kws, row_cluster, col_cluster, row_linkage, col_linkage, row_colors, col_colors, mask, dendrogram_ratio, colors_ratio, cbar_pos, tree_kws, **kwargs) 1250 raise RuntimeError("clustermap requires scipy to be available") 1252 plotter = ClusterGrid(data, pivot_kws=pivot_kws, figsize=figsize, 1253 row_colors=row_colors, col_colors=col_colors, 1254 z_score=z_score, standard_scale=standard_scale, 1255 mask=mask, dendrogram_ratio=dendrogram_ratio, 1256 colors_ratio=colors_ratio, cbar_pos=cbar_pos) -> 1258 return plotter.plot(metric=metric, method=method, 1259 colorbar_kws=cbar_kws, 1260 row_cluster=row_cluster, col_cluster=col_cluster, 1261 row_linkage=row_linkage, col_linkage=col_linkage, 1262 tree_kws=tree_kws, **kwargs) File /usr/local/lib/python3.11/site-packages/seaborn/matrix.py:1129, in ClusterGrid.plot(self, metric, method, colorbar_kws, row_cluster, col_cluster, row_linkage, col_linkage, tree_kws, **kws) 1125 kws.pop("square") 1127 colorbar_kws = {} if colorbar_kws is None else colorbar_kws -> 1129 self.plot_dendrograms(row_cluster, col_cluster, metric, method, 1130 row_linkage=row_linkage, col_linkage=col_linkage, 1131 tree_kws=tree_kws) 1132 try: 1133 xind = self.dendrogram_col.reordered_ind File /usr/local/lib/python3.11/site-packages/seaborn/matrix.py:974, in ClusterGrid.plot_dendrograms(self, row_cluster, col_cluster, metric, method, row_linkage, col_linkage, tree_kws) 970 def plot_dendrograms(self, row_cluster, col_cluster, metric, method, 971 row_linkage, col_linkage, tree_kws): 972 # Plot the row dendrogram 973 if row_cluster: --> 974 self.dendrogram_row = dendrogram( 975 self.data2d, metric=metric, method=method, label=False, axis=0, 976 ax=self.ax_row_dendrogram, rotate=True, linkage=row_linkage, 977 tree_kws=tree_kws 978 ) 979 else: 980 self.ax_row_dendrogram.set_xticks([]) File /usr/local/lib/python3.11/site-packages/seaborn/matrix.py:687, in dendrogram(data, linkage, axis, label, metric, method, rotate, tree_kws, ax) 684 if _no_scipy: 685 raise RuntimeError("dendrogram requires scipy to be installed") --> 687 plotter = _DendrogramPlotter(data, linkage=linkage, axis=axis, 688 metric=metric, method=method, 689 label=label, rotate=rotate) 690 if ax is None: 691 ax = plt.gca() File /usr/local/lib/python3.11/site-packages/seaborn/matrix.py:495, in _DendrogramPlotter.__init__(self, data, linkage, metric, method, axis, label, rotate) 492 self.rotate = rotate 494 if linkage is None: --> 495 self.linkage = self.calculated_linkage 496 else: 497 self.linkage = linkage File /usr/local/lib/python3.11/site-packages/seaborn/matrix.py:562, in _DendrogramPlotter.calculated_linkage(self) 558 msg = ("Clustering large matrix with scipy. Installing " 559 "`fastcluster` may give better performance.") 560 warnings.warn(msg) --> 562 return self._calculate_linkage_scipy() File /usr/local/lib/python3.11/site-packages/seaborn/matrix.py:530, in _DendrogramPlotter._calculate_linkage_scipy(self) 529 def _calculate_linkage_scipy(self): --> 530 linkage = hierarchy.linkage(self.array, method=self.method, 531 metric=self.metric) 532 return linkage File /usr/local/lib/python3.11/site-packages/scipy/cluster/hierarchy.py:1064, in linkage(y, method, metric, optimal_ordering) 1061 raise ValueError("`y` must be 1 or 2 dimensional.") 1063 if not np.all(np.isfinite(y)): -> 1064 raise ValueError("The condensed distance matrix must contain only " 1065 "finite values.") 1067 n = int(distance.num_obs_y(y)) 1068 method_code = _LINKAGE_METHODS[method] ValueError: The condensed distance matrix must contain only finite values.
<Figure size 1500x1500 with 0 Axes>
create_clustermap(df_red_wine)
create_clustermap(log_white_wine_df)
create_clustermap(log_red_wine_df)
Conclusions from the clustermaps
- Features that have high correlation between each other but low correlation with quality can be used to reduced dementions of our data (We can use PCA to identyfe as many features we need)
Selecting the most importnat features for quality prediction using top K features¶
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_selection import SelectKBest, f_classif
def visualize_feature_importance(datasets, names, k=5):
num_datasets = len(datasets)
# Create subplots
fig, axes = plt.subplots(1, num_datasets, figsize=(15, 5), sharey=True)
for i, (X, y) in enumerate(datasets):
selector = SelectKBest(score_func=f_classif, k=k)
X_new = selector.fit_transform(X, y)
# Visualize the scores of features
scores = -np.log10(selector.pvalues_)
scores /= scores.max()
# Plotting the bar plot for each dataset
axes[i].bar(range(X.shape[1]), scores, tick_label=X.columns.values.tolist())
axes[i].set_xticks(range(X.shape[1]))
axes[i].set_xticklabels(X.columns.values.tolist(), rotation=90)
axes[i].set_xlabel('Feature')
axes[i].set_ylabel('Score (-log10 p-value)')
axes[i].set_title(f'Dataset {names[i]} Feature Importance Scores')
plt.tight_layout()
plt.show()
# Example usage with two datasets (red wine and white wine)
datasets = [
(df_red_wine.drop(columns=['quality']), df_red_wine['quality']),
(df_white_wine.drop(columns=['quality']), df_white_wine['quality'])
]
visualize_feature_importance(datasets, ['Red Wine', 'White Wine'], k=5)
Visualize Data¶
- Quality distribiution based on the wine type (color)
def visualize_quality_histogram(datasets):
num_datasets = len(datasets)
# Create subplots
fig, axes = plt.subplots(1, num_datasets, figsize=(15, 5), sharey=True)
for i, (df, name) in enumerate(datasets):
quality_column = df['quality']
axes[i].hist(quality_column, bins=20, edgecolor='black', density = True)
axes[i].set_xlabel('Quality Score')
axes[i].set_ylabel('Density')
axes[i].set_title(f'{name} Quality Score Distribution')
plt.tight_layout()
plt.show()
# Example usage with two datasets (red wine and white wine)
datasets = [
(df_red_wine, 'Red Wine'),
(df_white_wine, 'White Wine')
]
visualize_quality_histogram(datasets)